I am looking forward to the course for two reasons:
Another initial impression of the course to me that the moodle page is rather verbose and chaotic and this hampers finding the necessary information and the assigned tasks. I hope that I will get used to this and will not miss any compulsary assignment.
I have found the course in Sisu because I still needed one credit to my Transferable skills section.
## [1] "I've found the book interesting and well written. Naturally due to my previous experience in R, I was only skim-reading it. Nevertheless I have found some useful functionalities that I haven't used before. e.g."
filter(year == 2020 | disease == COVID19)gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp, fill = country)) +
geom_col(position="dodge") + theme(legend.position = "none")
percent() function in tidyverse, it’s much simpler than sprintf(%2.1f%%).plotly and html output of Rmarkdown one can create interactive htmls (as one can hover over the diagram below to highlight countries).plot = gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, color = continent, label = country)) +
geom_point()
ggplotly(plot)
Describe the work you have done this week and summarize your learning.
We got data on some sort of questionnaire responses from individuals and their results on some sort of test (exam). The questionnaire seems to refer to study habits. The origin of the data cannot be clearly identified from the metadata provided.
df = as_tibble(read.csv('http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt', sep = '\t'))
learning2014 = read_csv('data/learning2014.csv') %>%
rename(points = Points, attitude = Attitude)
The original raw data consists of 183 records and 60 variables. After summarizing some questions into groups (such as questions referring to in-depth studying into a “depth” group) by taking their mean and removing observations with 0 points the processed data consists of 166 observations and 7 variables. Information on the data structure can be found below.
str(learning2014)
## spec_tbl_df [166 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ Age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. Age = col_double(),
## .. Attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. Points = col_double()
## .. )
Let’s have a look at the distribution of variables in this data by plotting them pairwise on a scatter plot. The figure below provides detailed insight into this data. We suspect that there are differences in the male and female participants hence we stratify the data into these two groups by colors.
colMap = c("F" = "#FF5555", "M" = "1100EE")
# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014, mapping = aes(color = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor")) ) +
scale_color_manual(values = colMap) +
scale_fill_manual(values = colMap)
We can read several pieces of information from the plot. First there were almost 1.5 times more females (red) in the study than males (blue) as can be seen by the barchart on the top-right. Based on the boxplots on the top the males attitude was higher towards statistics however there were no huge difference between the points of males and females based on gender. Nevertheless the positive correlation between attitude and points is observable for both males and females.
In a linear regression analysis we are searching for a linear relationship between a predictor (feature) variable and an expected target outcome. Let’s pick 3 that are the most likely candidates based on the previous plot:
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The results show that the attitude is significantly useful when predicting the exam points as the p-value for the slope of this variable is very low (\(1.93*10^{-8}\)). The F-statistic of the test also indicates that at least one of the selected variables has non-zero slope.
The t-test for each variable has the null-hypothesis that the slope of that variable in the regression equation is 0. If that is true then there is no association between the variable and the target. If the p-value is low then the result is unlikely under the null-hypothesis so usually we reject it i.e. we believe that there is association.
So fit the model again only with attitude (as that was the only one showing significance)
my_model <- lm(points ~ attitude, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The summary of this model shows that there is a positive correlation (with a slope of 3) between the attitude results and exam points. In lay terms this means that 1 unit increase in the attitude reflects more or less 3.5 points increase in the exam results. In addition to this there is a baseline of 11.6 for those with the lowest attitudes. The multiple R squared value gives the proportion of variation that is explained by the variables in the model. In this case it is 0.1906, so a little less than 20% of the variation is explained by the explanatory variable meaning that there are significant other factors affecting the exam results that we are omitting here.
par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))
The linear regression model assumes that the observations are produced by a linear model from the explanatory variable and there is additive noise with normal distribution (0 expected value) to these. The residuals represent this normal distribution, hence it’s good to check how normal these are, and if the residual values are independent of the “x” location (in other case the noise would be dependent on the observation). The first two plots refer to this. On the first one we can see that the residuals are evenly distributed over the fitted values. The Q-Q plot shows that the quantiles of the residual distribution are very close to the quantiles of a normal distribution. The last “Residuals vs. Leverage” plot indicates which variables may have the greatest effect on the parameters (it is also known that linear regression is outlier-prone), hence it could be useful to check of the model improves a lot by removing those observations.
Beyond explaining the data linear regression models can be also used to predict unobserved results such as below.
new_attitudes <- c("Mia" = 3.8, "Mike"= 4.4, "Riikka" = 2.2, "Pekka" = 2.9)
new_data <- data.frame(attitude = new_attitudes)
# Print out the new data
new_data
## attitude
## Mia 3.8
## Mike 4.4
## Riikka 2.2
## Pekka 2.9
# Predict the new students exam points based on attitude
predict(my_model, newdata = new_data)
## Mia Mike Riikka Pekka
## 25.03390 27.14918 19.39317 21.86099
As one can see it.
We got data on student performances at school, including Grades, number of previous class failures and connected background information such as family status, alcohol consumption etc. We had information from 2 classes: Math and Portugese. To create a joint dataset we round-averaged the grades from these 2 classes and included only students that were present in both classes (at least based on their attributes, the table does not contain unique student identifiers causing possible confusions).
# Double check the data wrangling result
df_dc = as_tibble(read.csv('https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv', sep = ','))
df = read_csv('data/student_joint.csv')
#full_join(df,df_dc)
# ok. this is of same size, e.g the tables must agree.
After these operations there are 370 students with 35 variables in the data.
str(df)
## spec_tbl_df [370 x 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ school : chr [1:370] "GP" "GP" "GP" "GP" ...
## $ sex : chr [1:370] "F" "F" "F" "F" ...
## $ age : num [1:370] 15 15 15 15 15 15 15 15 15 15 ...
## $ address : chr [1:370] "R" "R" "R" "R" ...
## $ famsize : chr [1:370] "GT3" "GT3" "GT3" "GT3" ...
## $ Pstatus : chr [1:370] "T" "T" "T" "T" ...
## $ Medu : num [1:370] 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : num [1:370] 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : chr [1:370] "at_home" "other" "at_home" "services" ...
## $ Fjob : chr [1:370] "other" "other" "other" "health" ...
## $ reason : chr [1:370] "home" "reputation" "reputation" "course" ...
## $ guardian : chr [1:370] "mother" "mother" "mother" "mother" ...
## $ traveltime: num [1:370] 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : num [1:370] 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ famsup : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ activities: chr [1:370] "yes" "no" "yes" "yes" ...
## $ nursery : chr [1:370] "yes" "no" "yes" "yes" ...
## $ higher : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ internet : chr [1:370] "yes" "yes" "no" "yes" ...
## $ romantic : chr [1:370] "no" "yes" "no" "no" ...
## $ famrel : num [1:370] 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : num [1:370] 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : num [1:370] 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : num [1:370] 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : num [1:370] 1 4 1 1 3 1 2 3 3 1 ...
## $ health : num [1:370] 1 5 2 5 3 5 5 4 3 4 ...
## $ failures : num [1:370] 0 1 0 0 1 0 1 0 0 0 ...
## $ absences : num [1:370] 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : num [1:370] 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : num [1:370] 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : num [1:370] 12 8 12 9 12 12 6 10 16 10 ...
## $ paid : chr [1:370] "yes" "no" "yes" "yes" ...
## $ alc_use : num [1:370] 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi [1:370] FALSE TRUE FALSE FALSE TRUE FALSE ...
## - attr(*, "spec")=
## .. cols(
## .. school = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. address = col_character(),
## .. famsize = col_character(),
## .. Pstatus = col_character(),
## .. Medu = col_double(),
## .. Fedu = col_double(),
## .. Mjob = col_character(),
## .. Fjob = col_character(),
## .. reason = col_character(),
## .. guardian = col_character(),
## .. traveltime = col_double(),
## .. studytime = col_double(),
## .. schoolsup = col_character(),
## .. famsup = col_character(),
## .. activities = col_character(),
## .. nursery = col_character(),
## .. higher = col_character(),
## .. internet = col_character(),
## .. romantic = col_character(),
## .. famrel = col_double(),
## .. freetime = col_double(),
## .. goout = col_double(),
## .. Dalc = col_double(),
## .. Walc = col_double(),
## .. health = col_double(),
## .. failures = col_double(),
## .. absences = col_double(),
## .. G1 = col_double(),
## .. G2 = col_double(),
## .. G3 = col_double(),
## .. paid = col_character(),
## .. alc_use = col_double(),
## .. high_use = col_logical()
## .. )
The 4 variables picked to examine if they have correlation with high alcohol consumption. Below I explain my a-priori expectations.
studytime: likely the less someone study the more time is available for consuming alcohol. It is unlikely that someone would use alcohol to mitigate stress coming from too much learning.activities: the more activities someone does the less time they have for hanging out with bad student groups, I expect again negative correlationhigher (meaning higher education plans): Student’s with cleaner future planning are less likely to be heavy alcohol consumers.freetime: probably students with too much free-time will get more bored hence consumer more alcohol to make their life exciting. p1 = df %>% ggplot() + aes(x = high_use, y = studytime, fill = high_use) + geom_violin()
p2 = df %>% ggplot() + aes(x = activities, fill = high_use) + geom_bar(position = "fill")
p3 = df %>% ggplot() + aes(x = higher, fill = high_use) + geom_bar(position = "fill")
p4 = df %>% ggplot() + aes(x = high_use, y = freetime, fill = high_use) + geom_violin()
grid.arrange(p1,p2,p3,p4)
m = glm(high_use ~ studytime + activities + higher + freetime, data = df, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ studytime + activities + higher + freetime,
## family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6471 -0.8668 -0.6585 1.1902 2.1362
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0841 0.7071 -0.119 0.905327
## studytime -0.5273 0.1577 -3.344 0.000826 ***
## activitiesyes -0.2283 0.2400 -0.951 0.341541
## higheryes -0.7545 0.5398 -1.398 0.162162
## freetime 0.3340 0.1246 2.680 0.007359 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 423.64 on 365 degrees of freedom
## AIC: 433.64
##
## Number of Fisher Scoring iterations: 4
The model suggests that the studytime and freetime variables are statistically significantly useful for predicting high-low alcohol usage. It is interesting that the higher education prediction is not significant, but I double checked the data, and it is severely skewed towards students planning higher education (16 students not planning while 354 planning). This means that this variable has not much affect on predicting the higher education planning but yet high alcohol consumers. The categorical variables (activities and higher) were converted internally to “dummy” representative levels. As there was only 2 levels in both cases only 1 dummy variable was created.
coef(m)
## (Intercept) studytime activitiesyes higheryes freetime
## -0.08410415 -0.52727634 -0.22825310 -0.75452928 0.33399121
exp(coef(m))
## (Intercept) studytime activitiesyes higheryes freetime
## 0.9193355 0.5902103 0.7959228 0.4702319 1.3965309
The coefficients represent the correlation between the variable and the output, the sign represents the shape of the logistic function, i.e. negative coefficient means that higher value (e.g. studytime) results in lower output (i.e. no high alcohol usage). In the exponential form these represent odds ratios for the unit change, in other words associating the variable’s importance in effecting the model. Values close to 1 would have no effect, deviations in either direction shows that the variable is important for the prediction.
confint(m)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -1.47694251 1.3160333
## studytime -0.84575068 -0.2261421
## activitiesyes -0.70031246 0.2420972
## higheryes -1.84750298 0.3028107
## freetime 0.09312641 0.5827670
These are 95% confidence intervals. It is usually understood that if these are not containing 0 then it is a significant result and indeed this matches with the significance table of the model. The results are matching with my hypothesis except for the higher education variable but that is explained by the skewing.
# Use only the significant variables
m2 = glm(high_use ~ studytime + freetime, data = df, family = "binomial")
probabilities <- predict(m2, type = "response")
df %<>% mutate(probability = probabilities)
df %<>% mutate(prediction = if_else(probability>0.5,TRUE,FALSE))
# tabulate the target variable versus the predictions
tt = table(high_use = df$high_use, prediction = df$prediction)
tt
## prediction
## high_use FALSE TRUE
## FALSE 250 9
## TRUE 102 9
This table is also called the confusion matrix. The usual metrics calculated from this table are the precision (TP/(TP+FP)) that is 0.5. This is pretty bad. Another common metric is recall (TP/(TP+FN)) that is 0.0810811 that is almost 0. Accuracy (the correct preddiictions / total predictions) was a bit better: 0.7. The training error is naturally 1-accuracy i.e. 0.3. This model performs pretty bad on predicting the high alcohol users, but pretty good on predicting the actually non high consumers. Simple guessing (without any other knowledge) should give a value around 0.5 hence this model is still a little better than random guess.
# load the data
data("Boston")
df = as_tibble(Boston)
In this exercise we are working with a dataset that is readily available in the MASS (Modern Applied Statistics with S) package of R and contains socio-economic features of Boston suburbs. More info on the data can be found at: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html. The raw data consists of 506 records that refer to different suburbs and 14 variables including for instance the crime rate of the town, the average age of buildings etc.
str(df)
## tibble [506 x 14] (S3: tbl_df/tbl/data.frame)
## $ crim : num [1:506] 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num [1:506] 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num [1:506] 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int [1:506] 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num [1:506] 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num [1:506] 6.58 6.42 7.18 7 7.15 ...
## $ age : num [1:506] 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num [1:506] 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int [1:506] 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num [1:506] 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num [1:506] 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num [1:506] 397 397 393 395 397 ...
## $ lstat : num [1:506] 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Let’s have a look at the distribution of variables in this data by plotting them pairwise on a scatter plot.
p = ggpairs(df, lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor", size = 2)) ) +
theme(text = element_text(size = 8), axis.text.x = element_text(angle = 45))
p
# It's a bit hard to see with the ggpairs in this case, however more informative then the simple corrplot.
cor_matrix <- cor(df)
corrplot(cor_matrix, method="circle")
The variables follow various distributions. For instance the crim variable that encodes for crime rates in town seems to follow some sort of power distribution, i.e. there are some suburbs with very high crime rates and then the others are rather low. On the other hand the rm variable that encodes room number clearly has a normal distribution centered around 6 rooms per dwelling (that feels a bit too high though). Yet again, indus and tax that encode for “proportion of non-retail business” i.e. how industrialized a town is and “full-value property tax rate” respectively, seem to have a bi-modal distribution suggesting that for instance there the industrialized suburbs separate from the living quarters.
One of the strongest correlation is related to this industrial variable, it is negatively correlated to the dis variable that measures “weighted mean of distances to five Boston employment centres.” This makes sense, likely the industrialized (including also likely white collar businesses) areas are employment centers, so the less industrialized a district is the further away it is from these employment centers. The nitrogen oxid concentration is strongly positively correlated to this distance meaning that these employment centers are not high in NO concentration, suggesting that the employment centers in boston are high-tech, non-polluting industries.
As all of these variables are continuous let’s scale them to have 0 mean and 1 standard deviation, by
\[scaled(x) = \frac{x - mean(x)}{ sd(x)}\]
df_scaled = as_tibble(scale(df))
summary(df_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
As it was expected the means became 0.
bins = quantile(df_scaled$crim)
# create a categorical variable 'crime'
df_scaled %<>% {mutate(.,crim = cut(.$crim, breaks = bins, include.lowest = TRUE))}
table(df_scaled$crim)
##
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 127 126 126 127
# The uniform distribution makes sense, as we divided by quantiles.
n = nrow(df_scaled)
ind = sample(n, size = n * 0.8)
train = df_scaled[ind,]
test = df_scaled[-ind,]
correct_classes = test$crim
test %<>% dplyr::select(-crim)
lda.fit <- lda(crim ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crim)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
The LDA suggests that the “access to radial highways” rad is positively correlated with the highest crime-rate group.
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## [-0.419,-0.411] 12 15 2 0
## (-0.411,-0.39] 6 13 3 0
## (-0.39,0.00739] 1 11 13 0
## (0.00739,9.92] 0 0 0 26
The confusion matrix shows that this model is pretty good in predicting the correct crime rate, the highest values are found in the diagonal. It is also visible that the easiest is to predict the highest crime rate (blue color on the LDA plot), as the LDA transformation has separated that group the best from others. The 3 lower crime group rate are more mixed hence more errors are made during the prediction.
# We don't need reload just do this again
df_scaled = as_tibble(scale(as_tibble(Boston)))
dist_eu = dist(df)
# I'm a bit confused, I don't think this is needed for the k-means function, but the assignment requires to calculate it. I think k-means will do this on its own.
km = kmeans(df_scaled, centers = 4)
pairs(df_scaled, col = km$cluster)
Not much is visible on these figures. I unfortunately tend to see 2 clusters mostly along the black variable suggesting segregation in this area.
set.seed(123)
# Let's not overdo it, k-means is a fairly expensive technique
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(df_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
km <- kmeans(df_scaled, centers = 2)
p = ggpairs(df_scaled %>% mutate(cluster = as_factor(km$cluster)), mapping = aes(color = cluster, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor", size = 2)) )
p
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
Indeed, based on the dropping WCSS (within cluster sum of squares, a measure of coherency within a cluster, but note: clustering is by definition dubious!) two clusters is a good number in this data, and it reveals segregated societal groups (districts) with clear separation in crime rates and lower status of the population (lstat). It’s interesting that the nitrogen-oxid pollution is actually better for the group with lower status. The lower status obviously pays less taxes (connected to likely less income), lives in older houses. The most intersting is the distribution of the average room numbers, the joint normal distribution is composed of two kindof “heavy-tailed” distribution to opposite directions: the lower status with fewer rooms, and the upper status with more rooms.
I personally like PCA so let’s visualize also with that (even though this is not listed as an extra point exercise)
pco = prcomp(df_scaled)
pco$x %>% as_tibble() %>% mutate(cluster = as_factor(km$cluster)) %>% ggplot() +
aes(x = PC1, y = PC2, color = cluster) +
geom_point()
And indeed: PC1 picks up exactly the same difference what was achieved with clustering, however suggests that the separation is not binary in the full data (there are features that mix these societes together).
# load the data
df = read.csv('data/human2.csv',row.names = 1)
# This is how the rowname headers are read back
ggpairs(df)
summary(df)
## Edu2.FM Labo.FM Edu.Exp Life.exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
In this exercise we are working with countries and data related to their human development. The data consists of 155 observations over 8 variables. The variables are all continuous but have various distributions, e.g. the expected years of education Edu.Exp is rather normal, whilst the Gross National Income (GNI) is more like a power distribution. Some variables are positively (e.g. expected years of education and life expectancy) and some are negatively (e.g. life expectancy and adolescent birth rate) correlated.
pca_human = prcomp(df)
biplot(pca_human, choices = 1:2, cex = c(0.5, 0.8), xlab = 'Gross national income')
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
# Explained variance proportion
sprintf('%.2f%% ',pca_human$sdev^2/sum(pca_human$sdev^2)*100)
## [1] "99.99% " "0.01% " "0.00% " "0.00% " "0.00% " "0.00% " "0.00% "
## [8] "0.00% "
# Can be seen also in
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
df_std = scale(df)
pca_human = prcomp(df_std)
biplot(pca_human, choices = 1:2, cex = c(0.5, 0.8), xlab = 'Human development', ylab = 'Gender equality')
# Explained variance proportion
sprintf('%.2f%% ',pca_human$sdev^2/sum(pca_human$sdev^2)*100)
## [1] "53.60% " "16.24% " "9.57% " "7.58% " "5.48% " "3.60% " "2.63% "
## [8] "1.30% "
# Can be seen also in
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
The results are clearly different. In the non-standardized case the first principal component accounts for almost 100% of the variation. This coincides with the GNI, as without normalization it’s values are overpowered over other variables. However in the normalized case the variables contribute more equally to the principal components. Moreover the plot seems to meaningfully cluster countries, e.g. the Nordic countries are grouped on the top-left, and indeed their human development is rather similar. I added my interpretation to the axis labels, however note that the directionality of PC is not well defined and the axes could be equally well mirrored. Specifically, even though I named PC1 as human development (in the scaled case), but actually the human development is decreasing from left to right.
I have already given my interpretation briefly by naming the axes above.
In this exercise we will work with another dataset on human participants’ tea consumption habits.
tea <- as_tibble(read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE))
str(tea)
## tibble [300 x 36] (S3: tbl_df/tbl/data.frame)
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int [1:300] 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
tea_time = tea %>% dplyr::select(c("Tea", "How", "how", "sugar", "where", "lunch"))
#View(tea) # I'd rather not use it in the RMarkdown
# Just visualize
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
The data consists of 300 observations and dim(tea)[2] variables.
mca = MCA(tea_time, graph = TRUE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = TRUE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
If two variables appear close on the variable representation plot it means that those properties seem to cluster / differentiate the participants. This is expanded on the MCA factor map, where one can see that people drinking tea in tea shops are drinking unpackaged tea that makes sense. However if someone is getting from the chain store then it’s more likely to buy it in tea bag.
This week we are recreating analysis from Chapter 8 and 9 of the MABS book, but swapping the datasets.
# load the data
rats = read_csv('data/rats_long.csv')
str(rats)
## spec_tbl_df [176 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ID : num [1:176] 1 1 1 1 1 1 1 1 1 1 ...
## $ Group : num [1:176] 1 1 1 1 1 1 1 1 1 1 ...
## $ Time : chr [1:176] "WD1" "WD8" "WD15" "WD22" ...
## $ Weight: num [1:176] 240 250 255 260 262 258 266 266 265 272 ...
## - attr(*, "spec")=
## .. cols(
## .. ID = col_double(),
## .. Group = col_double(),
## .. Time = col_character(),
## .. Weight = col_double()
## .. )
The rats data contains weekly measurements (over 9 weeks but not always just once a week) of the weight of 16 rats categorized into 3 treatment groups. This leads altogether to 11x16 = 176 observations. I will use the words treatment, group and diet interchangeably.
Let’s create the factor variables again.
rats %<>% mutate(across(c(ID,Group,Time),as_factor))
# Double check that the Time variable is increasing in the levels
rats$Time
## [1] WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64 WD1 WD8 WD15 WD22
## [16] WD29 WD36 WD43 WD44 WD50 WD57 WD64 WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44
## [31] WD50 WD57 WD64 WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64 WD1
## [46] WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64 WD1 WD8 WD15 WD22 WD29
## [61] WD36 WD43 WD44 WD50 WD57 WD64 WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50
## [76] WD57 WD64 WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64 WD1 WD8
## [91] WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64 WD1 WD8 WD15 WD22 WD29 WD36
## [106] WD43 WD44 WD50 WD57 WD64 WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57
## [121] WD64 WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64 WD1 WD8 WD15
## [136] WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64 WD1 WD8 WD15 WD22 WD29 WD36 WD43
## [151] WD44 WD50 WD57 WD64 WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## [166] WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## Levels: WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
RAW DATA
rats %>% ggplot() +
aes(x = Time, y = Weight, color = ID, group = ID) + geom_line() +
facet_wrap(~Group) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
The facets represent treatments. One can observe that especially in treatment group 2 and 3 the weights of the rats is greatly increasing. Even in treatment 1 (which is likely the control group) one can observe a small increase over time, but note also that these were starting from much lower levels suggesting that those rats were younger (lighter) at the start of the experiment, and this slight increase may be just the general weight-gain during development of adulthood.
STANDARDIZATION
It is clearly visible in this plot that individual variation makes it harder to compare the differences between individuals so some sort of standardization is needed. I note here that I think that the used standardization in the MABS book is not the best way to standardize. It is clear for instance that group 2 and 3 rats have started from a much higher weight so of course those treatments (which are diets by the way) would be different from diet 1. If these rats were grown on similar diets before the start of the experiment then I think we should normalize each subject to its individual Day 1 weight. I’ll follow this approach, as even though there is no Day 0 observation it is highly unlikely thatthe Day 1 differences are caused by only 1 day of living on the given diet.
rats %<>% group_by(ID) %>%
mutate(weightRef = Weight[1], stdWeight = Weight/weightRef) %>%
dplyr::select(-weightRef)
rats %>% ggplot() +
aes(x = Time, y = stdWeight, color = ID, group = ID) + geom_line() +
facet_wrap(~Group) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
In this standardized plot everyone is starting from it’s own reference and we see the body weight increase and decrease proportionally to that. The subject experiencing the biggest increase in treatment 2 have gained 20% extra weight during the study. Again, I think this is a fairer comparison that dividing by the observations day’s mean if we are interested in the individual effect of diets.
Let’s visualize this standardized weight on a box-plot stratified by treatment group.
rats %>% ggplot() +
aes(x = Time, y = stdWeight, color = Group) + geom_boxplot()
This plot reveals that the relative weight-gain of group 3 is always under the other groups. I tend to think that with the original standardization this would not be the case (as the absolute weight of group 3 is relatively high), and indeed that MABS book analysis of this data stated that Group 3 is growing more than group 2 (which is not true in my interpretation). It is interesting to see how Group 1 and 2 start to grow similarly but then around Day 50 the growth of group 2 is taking clearly over.
SUMMARY MEASURES
Based on this normalized plot and observations above a useful measure could be the difference between the last and first day observation (growth data for eventual value), but the delayed response (time to reach a given weight) would be also very interesting, especially to describe the joint move of group 1 and 2 in the beginning, the increased growth of group 2 in the end, and seemingly also that group 3 catches up in the end with group 2. However as some rats don’t ever grow more than 10% of their weights it’d be tedious to define such measure (but in a real data analyses I would do that). The regression slopes could be used too, but let’s keep those for part II. Based on this, let’s move on with the difference measures.
rats_summ = rats %>% mutate(diffWeight = stdWeight[Time == "WD64"] - stdWeight[Time == "WD1"]) %>%
summarize(Group = unique(Group), diffWeight = mean(diffWeight))
rats_summ %>% ggplot() + aes(x = Group, color = Group, y = diffWeight) + geom_boxplot() +
ylab("Relative weight gain WD64-WD1")
OUTLIER REMOVAL
I am not a big fan of outlier removal (it sounds like fishing in the data for our desired effects), so I will not perform it here. The boxplots above are anyways not indicative of huge outlier effects, the only outlier point is in group 1 namely
rats_summ %>% filter(Group == 1, diffWeight>0.15)
## # A tibble: 1 x 3
## ID Group diffWeight
## <fct> <fct> <dbl>
## 1 1 1 0.158
rat no. 1. One could remove it, but I don’t think it’ll ruin the downstream analysis.
T-TEST
Let’s perform pairwise t-tests on the groups.
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.0.4
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
rats_summ %>% pairwise_t_test(
diffWeight ~ Group, paired = FALSE,
p.adjust.method = "fdr"
)
## # A tibble: 3 x 9
## .y. group1 group2 n1 n2 p p.signif p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl> <chr>
## 1 diffWeight 1 2 8 4 0.0988 ns 0.148 ns
## 2 diffWeight 1 3 8 4 0.718 ns 0.718 ns
## 3 diffWeight 2 3 4 4 0.0858 ns 0.148 ns
Despite our observations the differences are non-significant for any pairs. One could increase the sample numbers, or pick a better summary statistic. As expected the smallest p-value is between group 2 and 3 that we identified as over-fed or under-fed groups. Let’s do a final try with the linear model fitting:
fit = rats %>% summarize(Group = unique(Group), baseline = stdWeight[1], weightGain = stdWeight[Time == "WD64"]) %>%
{lm(weightGain ~ baseline + Group, data = .)}
anova(fit)
## Analysis of Variance Table
##
## Response: weightGain
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 0.0095991 0.0047996 2.1004 0.162
## Residuals 13 0.0297058 0.0022851
This is still not significant (but I was not expecting it, as the techniques are very similar in these two cases, and I have taken the baseline already into account during normalization).
In this exercise part we’ll deal with the the BPRS (brief psychological response scale) measurements.
bprs = read_csv('data/bprs_long.csv')
##
## -- Column specification --------------------------------------------------------
## cols(
## treatment = col_double(),
## subject = col_double(),
## Week = col_character(),
## BPRS = col_double()
## )
str(bprs)
## spec_tbl_df [360 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ treatment: num [1:360] 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : num [1:360] 1 1 1 1 1 1 1 1 1 2 ...
## $ Week : chr [1:360] "week0" "week1" "week2" "week3" ...
## $ BPRS : num [1:360] 42 36 36 43 41 40 38 47 51 58 ...
## - attr(*, "spec")=
## .. cols(
## .. treatment = col_double(),
## .. subject = col_double(),
## .. Week = col_character(),
## .. BPRS = col_double()
## .. )
The bprs data contains weekly measurements (over 9 weeks including the start of the experiment) of the BPRS of 40 individuals categorized into 2 treatment groups. This leads altogether to 40x9 = 360 observations. I will use the words treatment and group interchangeably.
Let’s create the factor variables again.
bprs %<>% mutate(across(c(subject,treatment,Week),as_factor))
# Double check that the Week variable is increasing in the levels
levels(bprs$Week)
## [1] "week0" "week1" "week2" "week3" "week4" "week5" "week6" "week7" "week8"
Let’s visualize the data on graphs
bprs %>% ggplot() +
aes(x = Week, y = BPRS, color = subject, group = subject) +
geom_line() +
facet_wrap(~treatment) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
As described in the book, the BPRS measurements are constantly decreasing over time. Let’s do a regression model out of the box (without normalization). The intercept terms should take care of individual differences.
BASIC LINEAR MODEL
bprs_reg = lm(BPRS ~ Week + treatment, data = bprs)
summary(bprs_reg)
##
## Call:
## lm(formula = BPRS ~ Week + treatment, data = bprs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.714 -9.226 -2.989 6.786 48.389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.7139 2.0590 23.173 < 2e-16 ***
## Weekweek1 -1.6750 2.7624 -0.606 0.54467
## Weekweek2 -6.3000 2.7624 -2.281 0.02317 *
## Weekweek3 -8.8500 2.7624 -3.204 0.00148 **
## Weekweek4 -11.6500 2.7624 -4.217 3.15e-05 ***
## Weekweek5 -15.4500 2.7624 -5.593 4.51e-08 ***
## Weekweek6 -16.7750 2.7624 -6.073 3.28e-09 ***
## Weekweek7 -15.8000 2.7624 -5.720 2.29e-08 ***
## Weekweek8 -16.5750 2.7624 -6.000 4.92e-09 ***
## treatment2 0.5722 1.3022 0.439 0.66063
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.35 on 350 degrees of freedom
## Multiple R-squared: 0.2026, Adjusted R-squared: 0.182
## F-statistic: 9.878 on 9 and 350 DF, p-value: 1.62e-13
The summary indicates that the treatment variable do not have a significant effect on the outcome. However the passing of time makes a significant difference as shown by the Week values. This model has assumed independence which is probably not true in this case (the subjects are the same over time so likely they are not independent).
RANDOM INTERCEPT MODEL
library(lme4)
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.0.5
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
bprs_ref <- lmer(BPRS ~ Week + treatment + (1 | subject), data = bprs, REML = FALSE)
summary(bprs_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: BPRS ~ Week + treatment + (1 | subject)
## Data: bprs
##
## AIC BIC logLik deviance df.resid
## 2751.3 2798.0 -1363.7 2727.3 348
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2309 -0.6213 -0.1018 0.4877 3.3537
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.6 6.899
## Residual 100.8 10.039
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.7139 2.2758 20.965
## Weekweek1 -1.6750 2.2448 -0.746
## Weekweek2 -6.3000 2.2448 -2.807
## Weekweek3 -8.8500 2.2448 -3.942
## Weekweek4 -11.6500 2.2448 -5.190
## Weekweek5 -15.4500 2.2448 -6.883
## Weekweek6 -16.7750 2.2448 -7.473
## Weekweek7 -15.8000 2.2448 -7.039
## Weekweek8 -16.5750 2.2448 -7.384
## treatment2 0.5722 1.0582 0.541
##
## Correlation of Fixed Effects:
## (Intr) Wekwk1 Wekwk2 Wekwk3 Wekwk4 Wekwk5 Wekwk6 Wekwk7 Wekwk8
## Weekweek1 -0.493
## Weekweek2 -0.493 0.500
## Weekweek3 -0.493 0.500 0.500
## Weekweek4 -0.493 0.500 0.500 0.500
## Weekweek5 -0.493 0.500 0.500 0.500 0.500
## Weekweek6 -0.493 0.500 0.500 0.500 0.500 0.500
## Weekweek7 -0.493 0.500 0.500 0.500 0.500 0.500 0.500
## Weekweek8 -0.493 0.500 0.500 0.500 0.500 0.500 0.500 0.500
## treatment2 -0.232 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
This indicates also no significant effect of the treatment on the BPRS outputs.
RANDOM SLOPE MODEL
bprs_refslope <- lmer(BPRS ~ Week + treatment + (Week | subject), data = bprs, REML = FALSE)
## boundary (singular) fit: see help('isSingular')
summary(bprs_refslope)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: BPRS ~ Week + treatment + (Week | subject)
## Data: bprs
##
## AIC BIC logLik deviance df.resid
## 2825.5 3043.1 -1356.7 2713.5 304
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0214 -0.5624 -0.0749 0.5371 3.3227
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 52.035 7.214
## Weekweek1 11.038 3.322 0.33
## Weekweek2 8.858 2.976 0.09 0.68
## Weekweek3 15.579 3.947 0.00 0.92 0.83
## Weekweek4 24.598 4.960 -0.45 0.66 0.37 0.78
## Weekweek5 30.384 5.512 -0.58 0.47 0.12 0.58 0.96
## Weekweek6 43.630 6.605 -0.53 0.49 0.08 0.58 0.96 1.00
## Weekweek7 50.055 7.075 -0.41 0.57 0.09 0.61 0.95 0.98
## Weekweek8 52.016 7.212 -0.33 0.64 0.15 0.66 0.96 0.96
## Residual 90.726 9.525
##
##
##
##
##
##
##
##
## 0.99
## 0.98 1.00
##
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.7139 2.2632 21.083
## Weekweek1 -1.6750 2.2557 -0.743
## Weekweek2 -6.3000 2.2314 -2.823
## Weekweek3 -8.8500 2.3055 -3.839
## Weekweek4 -11.6500 2.4013 -4.852
## Weekweek5 -15.4500 2.4608 -6.278
## Weekweek6 -16.7750 2.5919 -6.472
## Weekweek7 -15.8000 2.6531 -5.955
## Weekweek8 -16.5750 2.6715 -6.204
## treatment2 0.5722 1.0040 0.570
##
## Correlation of Fixed Effects:
## (Intr) Wekwk1 Wekwk2 Wekwk3 Wekwk4 Wekwk5 Wekwk6 Wekwk7 Wekwk8
## Weekweek1 -0.367
## Weekweek2 -0.430 0.517
## Weekweek3 -0.434 0.552 0.535
## Weekweek4 -0.566 0.519 0.474 0.548
## Weekweek5 -0.614 0.486 0.431 0.512 0.606
## Weekweek6 -0.601 0.479 0.406 0.505 0.616 0.640
## Weekweek7 -0.552 0.491 0.400 0.510 0.618 0.640 0.667
## Weekweek8 -0.519 0.504 0.407 0.522 0.620 0.636 0.663 0.678
## treatment2 -0.222 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
anova(bprs_ref,bprs_refslope)
## Data: bprs
## Models:
## bprs_ref: BPRS ~ Week + treatment + (1 | subject)
## bprs_refslope: BPRS ~ Week + treatment + (Week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## bprs_ref 12 2751.3 2798.0 -1363.7 2727.3
## bprs_refslope 56 2825.5 3043.1 -1356.7 2713.5 13.837 44 1
Even the addition of the time variable to the random effects cannot make a difference between the groups.
RANDOM INTERCEPT AND SLOPE WITH INTERACTION MODEL
bprs_refslopeinteract <- lmer(BPRS ~ Week * treatment + (Week | subject), data = bprs, REML = FALSE)
## boundary (singular) fit: see help('isSingular')
summary(bprs_refslopeinteract)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: BPRS ~ Week * treatment + (Week | subject)
## Data: bprs
##
## AIC BIC logLik deviance df.resid
## 2832.7 3081.4 -1352.3 2704.7 296
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.1490 -0.5778 -0.0824 0.5461 3.4514
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 53.24 7.296
## Weekweek1 11.78 3.432 0.28
## Weekweek2 10.49 3.239 0.03 0.70
## Weekweek3 17.74 4.212 -0.05 0.92 0.85
## Weekweek4 26.40 5.138 -0.47 0.68 0.42 0.79
## Weekweek5 31.65 5.626 -0.59 0.49 0.16 0.60 0.96
## Weekweek6 45.18 6.722 -0.54 0.50 0.12 0.58 0.95 1.00
## Weekweek7 51.65 7.187 -0.42 0.58 0.13 0.61 0.95 0.98
## Weekweek8 53.79 7.334 -0.35 0.65 0.18 0.66 0.95 0.96
## Residual 88.10 9.386
##
##
##
##
##
##
##
##
## 0.99
## 0.98 1.00
##
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.000 2.658 17.680
## Weekweek1 -0.200 3.066 -0.065
## Weekweek2 -3.450 3.055 -1.129
## Weekweek3 -6.100 3.114 -1.959
## Weekweek4 -10.400 3.183 -3.268
## Weekweek5 -14.300 3.224 -4.436
## Weekweek6 -17.300 3.327 -5.200
## Weekweek7 -17.200 3.375 -5.096
## Weekweek8 -17.700 3.391 -5.220
## treatment2 2.000 2.968 0.674
## Weekweek1:treatment2 -2.950 4.197 -0.703
## Weekweek2:treatment2 -5.700 4.197 -1.358
## Weekweek3:treatment2 -5.500 4.197 -1.310
## Weekweek4:treatment2 -2.500 4.197 -0.596
## Weekweek5:treatment2 -2.300 4.197 -0.548
## Weekweek6:treatment2 1.050 4.197 0.250
## Weekweek7:treatment2 2.800 4.197 0.667
## Weekweek8:treatment2 2.250 4.197 0.536
##
## Correlation matrix not shown by default, as p = 18 > 12.
## Use print(x, correlation=TRUE) or
## vcov(x) if you need it
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
anova(bprs_refslope,bprs_refslopeinteract)
## Data: bprs
## Models:
## bprs_refslope: BPRS ~ Week + treatment + (Week | subject)
## bprs_refslopeinteract: BPRS ~ Week * treatment + (Week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## bprs_refslope 56 2825.5 3043.1 -1356.7 2713.5
## bprs_refslopeinteract 64 2832.7 3081.4 -1352.3 2704.7 8.7954 8 0.3599
Based on the Chi-squared test this model is the closest to the actual values, but it is still far from ideal. I conclude that there are no changes between the groups.